Housing Price Prediction Research Reproduction

Author

Oksana Chubatenko, Oliwia Iwańska, Andrea Furmanek, Aleksander Karbowiak

Housing Price Prediction via Improved Machine Learning Techniques

Reproduction of Quang Truong, Minh Nguyen, Hy Dang and Bo Mei research

The purpose of the original article was to prepare a model predicting housing prices in Beijing and to determine the impact of various characteristics on property prices. As part of our work on the reproduction of a given article, we used the R language with its libraries (the analysis in the initial article was carried out with the help of Python).

Data

Read data with “gbk” encoding for Chinese signs and getting some columns insights

```{r}
#read csv as dataframe
housing_prices_data <- as.data.frame(read.csv("new.csv",fileEncoding="gbk", header = TRUE)) #fileEncoding='gbk' is chinese signs encoding
str(housing_prices_data)
```
'data.frame':   318851 obs. of  26 variables:
 $ url                : chr  "https://bj.lianjia.com/chengjiao/101084782030.html" "https://bj.lianjia.com/chengjiao/101086012217.html" "https://bj.lianjia.com/chengjiao/101086041636.html" "https://bj.lianjia.com/chengjiao/101086406841.html" ...
 $ id                 : chr  "101084782030" "101086012217" "101086041636" "101086406841" ...
 $ Lng                : num  116 116 117 116 116 ...
 $ Lat                : num  40 39.9 39.9 40.1 39.9 ...
 $ Cid                : num  1.11e+12 1.11e+12 1.11e+12 1.11e+12 1.11e+12 ...
 $ tradeTime          : chr  "2016-08-09" "2016-07-28" "2016-12-11" "2016-09-30" ...
 $ DOM                : num  1464 903 1271 965 927 ...
 $ followers          : int  106 126 48 138 286 57 167 138 218 134 ...
 $ totalPrice         : num  415 575 1030 298 392 ...
 $ price              : int  31680 43436 52021 22202 48396 52000 37672 49521 27917 55883 ...
 $ square             : num  131 132 198 134 81 ...
 $ livingRoom         : chr  "2" "2" "3" "3" ...
 $ drawingRoom        : chr  "1" "2" "2" "1" ...
 $ kitchen            : int  1 1 1 1 1 1 1 1 1 0 ...
 $ bathRoom           : chr  "1" "2" "3" "1" ...
 $ floor              : chr  "高 26" "高 22" "中 4" "底 21" ...
 $ buildingType       : num  1 1 4 1 4 4 4 1 3 1 ...
 $ constructionTime   : chr  "2005" "2004" "2005" "2008" ...
 $ renovationCondition: int  3 4 3 1 2 3 4 4 1 4 ...
 $ buildingStructure  : int  6 6 6 6 2 6 2 6 2 6 ...
 $ ladderRatio        : num  0.217 0.667 0.5 0.273 0.333 0.333 0.5 0.667 0.333 0.308 ...
 $ elevator           : num  1 1 1 1 0 1 0 1 0 1 ...
 $ fiveYearsProperty  : num  0 1 0 0 1 1 0 1 0 1 ...
 $ subway             : num  1 0 0 0 1 0 0 0 0 1 ...
 $ district           : int  7 7 7 6 1 7 7 7 13 1 ...
 $ communityAverage   : num  56021 71539 48160 51238 62588 ...

Data cleaning

In the original article, attributes describing the number of kitchens, bathrooms and drawing rooms were removed due to their ambiguity. We decided to keep them and give them a chance in further analysis. Columns such as “ID” or “URL” were removed. As in the article, we removed the variable “DOM” (“Day on market”) because as many as 49.5% of the observations contained empty values for this variable. As in the article, we decided to remove records containing any null values - the percentage of such cases was very low, so this decision did not significantly affect further work.

```{r}
#drop unnecessary columns
drop_cols <- c("url","id")
housing_prices_data <- housing_prices_data[ , !(names(housing_prices_data) %in% drop_cols)]

# finds the count of missing values as % of the whole dataset
colMeans(is.na(housing_prices_data))*100


housing_prices_data <- housing_prices_data[ , !(names(housing_prices_data) %in% "DOM")]

#Now we are going to check how many rows have missing values

obs_with_nulls <- housing_prices_data[!complete.cases(housing_prices_data),]


#Distributions of full dataset and null-rows-dataset are similar. Also there is only 2403 obs of rows with null values. 
#Then we can delete rows with NULLs.
housing_prices_data_clean <- na.omit(housing_prices_data)
```
                Lng                 Lat                 Cid           tradeTime 
         0.00000000          0.00000000          0.00000000          0.00000000 
                DOM           followers          totalPrice               price 
        49.54571257          0.00000000          0.00000000          0.00000000 
             square          livingRoom         drawingRoom             kitchen 
         0.00000000          0.00000000          0.00000000          0.00000000 
           bathRoom               floor        buildingType    constructionTime 
         0.00000000          0.00000000          0.63383838          0.00000000 
renovationCondition   buildingStructure         ladderRatio            elevator 
         0.00000000          0.00000000          0.00000000          0.01003604 
  fiveYearsProperty              subway            district    communityAverage 
         0.01003604          0.01003604          0.00000000          0.14520889 

Data preprocessing

In the process of preparing the data for modeling, we performed the necessary transformations of the variables. From the variable “floor” we extracted the height of the floor, getting rid of its type in contrast to the original article. The type describes the height just like a numerical value, so it doesn’t make sense to keep both variables. We replaced the “Unknown” values of the Age calculated variable with the average of ConstructionTime. We also truncated categories such as “renovationCondition” to reduce the number of levels of variables with similar or nearly identical business overtones. We created new additional variables that were not created as part of the initial article - such as “avgRoomSize,” which describes the average size of usable rooms in a given apartment. We also got rid of outliers using Inter-Quartile Range (IQR).

```{r}
summary(housing_prices_data_clean)
#converting some variables to number
housing_prices_data_clean$livingRoom <- as.integer(housing_prices_data_clean$livingRoom)
housing_prices_data_clean$drawingRoom <- as.integer(housing_prices_data_clean$drawingRoom)
housing_prices_data_clean$bathRoom <- as.integer(housing_prices_data_clean$bathRoom)

#Division of signs and numbers (floor type and height)
housing_prices_data_clean$floorType <- substring(housing_prices_data_clean$floor,1,2)
housing_prices_data_clean$floorNum <- as.integer(substring(housing_prices_data_clean$floor,3,length(housing_prices_data_clean$floor)-2))
housing_prices_data_clean <- housing_prices_data_clean[ , !(names(housing_prices_data_clean) %in% "floor")]

# Floor types translation - is like a group of floorNum then it can be removed
# 高 - High
# 未知 - Unknown
# 顶  - Top
# 低 - Low
# 底 - Bottom
# 中 - Medium

#Calculation of distance between home and Beijing city center (Forbidden City coordinates in our case)
BeijingCenterLat <- 39.91690639140218
BeijingCenterLng <- 116.39716443298232
#Haversine distance to get result in kilometers
housing_prices_data_clean$distance <- distHaversine(p1=housing_prices_data_clean[,c("Lng","Lat")],c(BeijingCenterLng,BeijingCenterLat))/1000

#Age -> construction time - current year
current_year <- as.numeric(format(Sys.Date(),"%Y"))

#Check frequency
table(housing_prices_data_clean$constructionTime)

#If unknown then use "Average" OR Maybe we should get rid off AGE/Construction Time variable
meanConcstrTime <- round(mean(as.integer(housing_prices_data_clean$constructionTime),na.rm = TRUE))

housing_prices_data_clean$age <- ifelse(housing_prices_data_clean$constructionTime=='未知',current_year-meanConcstrTime,current_year - as.integer(housing_prices_data_clean$constructionTime))
housing_prices_data_clean <- housing_prices_data_clean[ , !(names(housing_prices_data_clean) %in% "constructionTime")]

##changing numeric to categories factors
housing_prices_data_clean$buildingType <- ifelse(housing_prices_data_clean$buildingType==1,"Tower",ifelse(housing_prices_data_clean$buildingType==2,"Bungalow",ifelse(housing_prices_data_clean$buildingType==3,"Plate and Tower","Plate")))
#Bungalow building Types will be deleted from dataset since they are outliers and bungalow is completely different than other types
housing_prices_data_clean <- housing_prices_data_clean[housing_prices_data_clean$buildingType != "Bungalow", ]
housing_prices_data_clean$buildingType <- as.factor(housing_prices_data_clean$buildingType)

housing_prices_data_clean$renovationCondition <- ifelse(housing_prices_data_clean$renovationCondition==1,"Other",ifelse(housing_prices_data_clean$renovationCondition==2,"Rough",ifelse(housing_prices_data_clean$renovationCondition==3,"Simplicity","Hardcover")))
# Rough renovation condition will be assign to Hardcover
housing_prices_data_clean$renovationCondition <- as.factor(ifelse(housing_prices_data_clean$renovationCondition=="Rough","Hardcover",housing_prices_data_clean$renovationCondition))
  
housing_prices_data_clean$buildingStructure <- ifelse(housing_prices_data_clean$buildingStructure==1,"Unknow",ifelse(housing_prices_data_clean$buildingStructure==2,"Mixed",ifelse(housing_prices_data_clean$buildingStructure==3,"Brick and wood",ifelse(housing_prices_data_clean$buildingStructure==4,"Brick and concrete",ifelse(housing_prices_data_clean$buildingStructure==5,"Steel","Steel-concrete composite")))))
##Dealing with outliers
housing_prices_data_clean$buildingStructure <- as.factor(ifelse(housing_prices_data_clean$buildingStructure=="Steel","Steel-concrete composite",ifelse(housing_prices_data_clean$buildingStructure=="Brick and wood","Mixed",ifelse(housing_prices_data_clean$buildingStructure=="Unknow","Mixed",housing_prices_data_clean$buildingStructure))))

#housing_prices_data_clean$floorType <- as.factor(housing_prices_data_clean$floorType)

housing_prices_data_clean$elevator <- ifelse(housing_prices_data_clean$elevator==1,"Elevator","noElevator")
housing_prices_data_clean$elevator <- as.factor(housing_prices_data_clean$elevator)
housing_prices_data_clean$fiveYearsProperty <- ifelse(housing_prices_data_clean$fiveYearsProperty==1,"isFiveYProp","noFiveYProp")
housing_prices_data_clean$fiveYearsProperty <- as.factor(housing_prices_data_clean$fiveYearsProperty)
housing_prices_data_clean$subway <- ifelse(housing_prices_data_clean$subway==1,"Subway","NoSubway")
housing_prices_data_clean$subway <- as.factor(housing_prices_data_clean$subway)
housing_prices_data_clean$district <- as.factor(housing_prices_data_clean$district)

#char to Date
housing_prices_data_clean$tradeTime <- as.Date(housing_prices_data_clean$tradeTime)

##New feature - average size of a room
housing_prices_data_clean$avgRoomSize <- housing_prices_data_clean$square/(housing_prices_data_clean$livingRoom + housing_prices_data_clean$drawingRoom + housing_prices_data_clean$kitchen + housing_prices_data_clean$bathRoom)

housing_prices_data_clean$totalPrice <- housing_prices_data_clean$totalPrice * 10000 #real scale

#########
##INFO:
## totalPrice is price (average price by sqrt * square) * 10 000
##Currency is yuan
########
```
      Lng             Lat             Cid             tradeTime        
 Min.   :116.1   Min.   :39.63   Min.   :1.111e+12   Length:316448     
 1st Qu.:116.3   1st Qu.:39.89   1st Qu.:1.111e+12   Class :character  
 Median :116.4   Median :39.93   Median :1.111e+12   Mode  :character  
 Mean   :116.4   Mean   :39.95   Mean   :1.125e+12                     
 3rd Qu.:116.5   3rd Qu.:40.00   3rd Qu.:1.111e+12                     
 Max.   :116.7   Max.   :40.25   Max.   :1.185e+14                     
   followers         totalPrice         price            square      
 Min.   :   0.00   Min.   :   0.1   Min.   :     1   Min.   :  7.37  
 1st Qu.:   0.00   1st Qu.: 205.0   1st Qu.: 28090   1st Qu.: 57.90  
 Median :   5.00   Median : 294.0   Median : 38778   Median : 74.16  
 Mean   :  16.71   Mean   : 347.9   Mean   : 43550   Mean   : 82.84  
 3rd Qu.:  18.00   3rd Qu.: 425.0   3rd Qu.: 53857   3rd Qu.: 98.48  
 Max.   :1143.00   Max.   :4900.0   Max.   :156250   Max.   :640.00  
  livingRoom        drawingRoom           kitchen         bathRoom        
 Length:316448      Length:316448      Min.   :0.0000   Length:316448     
 Class :character   Class :character   1st Qu.:1.0000   Class :character  
 Mode  :character   Mode  :character   Median :1.0000   Mode  :character  
                                       Mean   :0.9946                     
                                       3rd Qu.:1.0000                     
                                       Max.   :3.0000                     
    floor            buildingType  constructionTime   renovationCondition
 Length:316448      Min.   :1.00   Length:316448      Min.   :1.000      
 Class :character   1st Qu.:1.00   Class :character   1st Qu.:1.000      
 Mode  :character   Median :4.00   Mode  :character   Median :3.000      
                    Mean   :3.01                      Mean   :2.607      
                    3rd Qu.:4.00                      3rd Qu.:4.000      
                    Max.   :4.00                      Max.   :4.000      
 buildingStructure  ladderRatio          elevator      fiveYearsProperty
 Min.   :1.000     Min.   :       0   Min.   :0.0000   Min.   :0.0000   
 1st Qu.:2.000     1st Qu.:       0   1st Qu.:0.0000   1st Qu.:0.0000   
 Median :6.000     Median :       0   Median :1.0000   Median :1.0000   
 Mean   :4.452     Mean   :      64   Mean   :0.5799   Mean   :0.6467   
 3rd Qu.:6.000     3rd Qu.:       0   3rd Qu.:1.0000   3rd Qu.:1.0000   
 Max.   :6.000     Max.   :10009400   Max.   :1.0000   Max.   :1.0000   
     subway          district      communityAverage
 Min.   :0.0000   Min.   : 1.000   Min.   : 10847  
 1st Qu.:0.0000   1st Qu.: 6.000   1st Qu.: 46410  
 Median :1.0000   Median : 7.000   Median : 59025  
 Mean   :0.6026   Mean   : 6.767   Mean   : 63784  
 3rd Qu.:1.0000   3rd Qu.: 8.000   3rd Qu.: 76001  
 Max.   :1.0000   Max.   :13.000   Max.   :183109  

 1950  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961  1962  1963 
   11     5    13    61    61    77    66   124    49   175     8    21    89 
 1964  1965  1966  1967  1968  1969  1970  1971  1972  1973  1974  1975  1976 
  103   190    93    73     4     4   280     8    32   167   178   382   437 
 1977  1978  1979  1980  1981  1982  1983  1984  1985  1986  1987  1988  1989 
  501   731  1749  3295  2080  2766  2879  3344  4693  4850  4923  5557  5482 
 1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002 
 8368  4306  8472  6746  7713  9042  9042  6442 11281 10163 13809 10525 12287 
 2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015 
19252 21003 18734 14739 14126 12095 11506  7213  5422  4982  2266  2079   445 
 2016  未知 
   82 18747 

In addition to using the IQR method to detect outliers, we also analyzed the distributions of the variables.

```{r}
#density of totalPrice
ggplot(housing_prices_data_clean, aes(x = totalPrice)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density()

ggplot(housing_prices_data_clean, aes(x = followers)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density()

ggplot(housing_prices_data_clean, aes(x = floorNum)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density()

ggplot(housing_prices_data_clean, aes(x = communityAverage)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density()

ggplot(housing_prices_data_clean, aes(x = square)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density()
               
ggplot(housing_prices_data_clean, aes(x = price)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density()                   

boxplot(housing_prices_data_clean$totalPrice)

#Looking for outliers in numerical variables using IQR
q<-NULL
iqr<-NULL
upper<-NULL
lower<-NULL
IQRcutData <- function(data, lower_quantile = 0.25, upper_quantile = 0.75)
{
  q <<- quantile(data, probs=c(lower_quantile, upper_quantile),na.rm=TRUE)
  iqr <<- q[2]-q[1]
  upper <<- q[2] + 1.5*iqr
  lower <<-q[1] - 1.5*iqr
  #outliers_totalPrice <- data > upper | data < lower
  
  data_cut <- data
  data_cut[data < lower] <-lower
  data_cut[data > upper] <- upper
  
  return(data_cut)
  
}



########################################################## Outliers IQR
cut_totalPrice <-IQRcutData(housing_prices_data_clean$totalPrice)
housing_prices_data_clean$totalPrice <- cut_totalPrice

cut_followers <- IQRcutData(housing_prices_data_clean$followers)
housing_prices_data_clean$followers <- cut_followers

cut_floorNum <- IQRcutData(housing_prices_data_clean$floorNum)
housing_prices_data_clean$floorNum <- cut_floorNum

cut_communityAverage <- IQRcutData(housing_prices_data_clean$communityAverage)
housing_prices_data_clean$communityAverage <- cut_communityAverage

cut_square <- IQRcutData(housing_prices_data_clean$square)
housing_prices_data_clean$square <- cut_square

cut_price <- IQRcutData(housing_prices_data_clean$price)
housing_prices_data_clean$price <- cut_price
```

Data Analysis

At the first step of data analysis process we have decided to perform correlation analysis. In the article, correlation analysis was carried out only for selected variables. We approached the issue holistically.

```{r}
#Correlation matrix for numerics
corrData <- housing_prices_data_clean
drop_colsCorr2 <- c("tradeTime","Cid","floorType")
drop_colsCorr <- c("Cid","Lng","Lat","tradeTime","buildingType","renovationCondition","buildingStructure","floorType","elevator","fiveYearsProperty","subway","district")
corrData <- corrData[ , !(names(corrData) %in% drop_colsCorr)]

res <- cor(corrData)
#corrplot(res,method="number")

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(res, method="color", col=col(200),  
         type="upper", order="hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE , number.cex = 0.4, tl.cex = 0.5
)
```

Thus, we could see a strong positive correlation of variables such as “totalPrice”, “price” and “communityAverage”. The new variable “avgRoomSize” is mildly correlated with the variable “square,” but has nothing to do with the variables describing the number of rooms, which is a good sign from a modeling perspective. The variables “distance” and “communityAverage” showed a strong negative correlation. We also performed correlation verification on one-hot-encoded variables.

```{r}
#One-hot encoding - correlation matrix of all vars

housing_prices_data_clean <- housing_prices_data_clean[ , !(names(housing_prices_data_clean) %in% drop_colsCorr2)]
# Convert factor variables to dummy variables
dummy_vars <- lapply(housing_prices_data_clean[, sapply(housing_prices_data_clean, is.factor)], function(x) model.matrix(~ x - 1, data = housing_prices_data_clean))

# Combine dummy variables with numeric variables
housing_prices_data_clean_onehot <- cbind(housing_prices_data_clean[, !sapply(housing_prices_data_clean, is.factor)], do.call(cbind, dummy_vars))
# Calculate correlation matrix
correlation_matrix <- cor(housing_prices_data_clean_onehot)

# Visualize correlation matrix as heatmap
ggplot(data = melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, name="Correlation") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Correlation Plot")
```

The multitude of variables did not facilitate the analysis, but interesting conclusions were nevertheless drawn. Floor height is negatively correlated with certain types of buildings. High floors are not built in “Plate” type properties. In addition, the lack of an elevator is strongly negatively correlated with high floors.

We also prepared a standardized version of the dataset for some models.

```{r}
#restoring tradeTime
housing_prices_data_clean_onehot$tradeTime <- housing_prices_data_clean$tradeTime

## Data for modeling

# Custom function to standardize numeric and integer columns
standardize_cols <- function(x) {
  if (is.numeric(x) || is.integer(x)) {
    return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
  } else {
    return(x)
  }
}

# Standardize numeric and integer columns
housing_prices_data_clean_stand <- housing_prices_data_clean %>%
  mutate(across(where(is.numeric) | where(is.integer), standardize_cols))
#one-hot encoding once again

# Convert factor variables to dummy variables
dummy_vars <- lapply(housing_prices_data_clean_stand[, sapply(housing_prices_data_clean_stand, is.factor)], function(x) model.matrix(~ x - 1, data = housing_prices_data_clean_stand))

# Combine dummy variables with numeric variables
housing_prices_data_clean_stand_onehot <- cbind(housing_prices_data_clean_stand[, !sapply(housing_prices_data_clean_stand, is.factor)], do.call(cbind, dummy_vars))
```

Geolocalization with distribution of Age/Price

```{r}
ggplot(housing_prices_data_clean,aes(x=Lat,y=Lng,group=age))+
   geom_point(aes(color=age)) +
    scale_colour_gradient(low="#66C2A5", high="#FC8D62")
```

With the plot above we have agreed outcomes from original paper that the oldest houses are settled in the city center mostly. Younger homes are in the suburbs.

```{r}
ggplot(housing_prices_data_clean,aes(x=Lat,y=Lng,group=price))+
   geom_point(aes(color=price)) +
    scale_colour_gradient(low="#66C2A5", high="#FC8D62")
```

Similar situation as before. The most expensive parcels can be found in the city center.

```{r}
ggplot(housing_prices_data_clean, aes(x=district, y=price, fill=district)) + 
    geom_boxplot(alpha=0.3) +
    theme(legend.position="none") +
    scale_fill_brewer(palette="BuPu")
```

From the plot above we can find out that the most expensive homes on average are in districts 1, 10 and 8.

```{r}
ggplot(housing_prices_data_clean, aes(x=buildingType, y=price, fill=buildingType)) + 
    geom_boxplot(alpha=0.3) +
    theme(legend.position="none") +
    scale_fill_brewer(palette="BuPu")
```

```{r}
ggplot(housing_prices_data_clean, aes(x=buildingType, y=square, fill=buildingType)) + 
    geom_boxplot(alpha=0.3) +
    theme(legend.position="none") +
    scale_fill_brewer(palette="BuPu")
```

Type of a building does not influence the price or square meters.

Random Forest

Now we reproduce the Random Forest model. The ranger method is used for the Random Forest algorithm with 900 trees and 10 threads for parallel processing to enhance computation speed. The model’s importance measure is set to “impurity”. Pre-processing steps include centering and scaling the data to standardize it. Cross-validation with 10 folds is controlled through ctrl_cv10, and the tuning grid specifies parameters for the model: a maximum of 20 variables to try at each split (mtry), a splitting rule based on variance, and a minimum node size of 100.

Data Preparation for Modeling and setting the evaluation parametrs

```{r}
# Splitting the dataset fot train and test

set.seed(123456789)

data_which_train <- createDataPartition(housing_prices_data_clean$totalPrice , # target variable
                                        
                                        p = 0.8, 
                                        list = FALSE) 

data_train <- housing_prices_data_clean[data_which_train,]
data_test <- housing_prices_data_clean[-data_which_train,]

# function for the model evaluation
regressionMetrics <- function(data,
                              lev = NULL,
                              model = NULL) {
  real <- data$obs
  predicted <- data$pred
  
  # Mean Square Error
  MSE <- mean((real - predicted)^2)
  # Root Mean Square Error
  RMSE <- sqrt(MSE)
  # Mean Absolute Error
  MAE <- mean(abs(real - predicted))
  # Mean Absolute Percentage Error
  MAPE <- mean(abs(real - predicted)/real)
  # Median Absolute Error
  MedAE <- median(abs(real - predicted))
  # Mean Logarithmic Absolute Error
  MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
  # Root Mean Squared Logarithmic Error (RMSLE)
  RMSLE <- sqrt(MSLE)
  # R2
  R2 <- cor(predicted, real)^2
  
  result <- c(MSE = MSE, RMSE = RMSE, MAE = MAE, MAPE = MAPE, MedAE = MedAE,
              MSLE = MSLE, RMSLE = RMSLE, R2 = R2)
  return(result)
}
  
# 10-fold cross validation 
ctrl_cv10 <- trainControl(method = "cv",
                          number = 10,
                          savePredictions = "final",
                          summaryFunction = regressionMetrics)
```

Model Parameters

The parameters for the model are defined based on the article.

```{r}
# set.seed(123456789)
 # data_rf <- 
 #   caret::train(totalPrice ~ ., 
 #         data = data_train,
 #         method = "ranger",
 #         num.trees = 900,
 #         num.threads = 10,
 #         importance = "impurity",
 #         preProcess = c("center", "scale"),
 #         trControl = ctrl_cv10,
 #         tuneGrid = expand.grid(mtry = 20,
 #                                splitrule = "variance",
 #                                min.node.size = 100))


# loading from saved file
data_rf <- readRDS("data_rf_model.rds")
data_rf
```
Random Forest 

253113 samples
    23 predictor

Pre-processing: centered (37), scaled (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 227802, 227802, 227800, 227803, 227803, 227802, ... 
Resampling results:

  MSE         RMSE      MAE       MAPE       MedAE     MSLE        RMSLE    
  8786246935  93695.62  29103.36  0.2523139  7882.703  0.02567119  0.1590321
  R2       
  0.9971253

Tuning parameter 'mtry' was held constant at a value of 20
Tuning
 parameter 'splitrule' was held constant at a value of variance

Tuning parameter 'min.node.size' was held constant at a value of 100

The paper’s model performs better in terms of RMSLE, indicating it has lower prediction error when the logarithmic scale is used. The RMSLE of 0.12980 is indeed lower than 0.1590321, suggesting that our model’s predictions are slightly less accurate than those of the paper’s model on the training data.

eXtreme Gradient Boosting

Model Parameters

The parameters for the model are defined based on the article.

```{r}
# set.seed(123456789)
# 
#  data_xgboost <- 
#    caret::train(totalPrice ~ ., 
#          data = data_train,
#          method = "xgbTree",
#          preProcess = c("center", "scale"),
#          trControl = ctrl_cv10,
#          tuneGrid = expand.grid(nrounds = 200,           
#                                 max_depth = 5,          
#                                 eta = 0.1,              
#                                 gamma = 0.5,            
#                                 colsample_bytree = 0.8,
#                                 min_child_weight = 2,   
#                                 subsample = 1))

# loading from saved file
data_xgboost <- readRDS("data_xgboost_model.rds")
data_xgboost
```
eXtreme Gradient Boosting 

253113 samples
    23 predictor

Pre-processing: centered (37), scaled (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 227802, 227802, 227800, 227803, 227803, 227802, ... 
Resampling results:

  MSE         RMSE      MAE      MAPE       MedAE     MSLE  RMSLE  R2       
  8849962757  93973.28  54423.7  0.0759587  34301.51  NaN   NaN    0.9970425

Tuning parameter 'nrounds' was held constant at a value of 200
Tuning
 held constant at a value of 2
Tuning parameter 'subsample' was held
 constant at a value of 1

eXtreme Gradient Boosting model shows strong performance metrics with preprocessing steps of centering and scaling. The model was evaluated using 10-fold cross-validation, yielding an MSE of 8,849,962,757 and an RMSE of 93,973.28. The MAE is 54,423.7, and the MAPE is 0.0759587, indicating low percentage errors. The MedAE is 34,301.51, and the R² value of 0.9970425 suggests that the model explains over 99% of the variance in the target variable. However, MSLE and RMSLE values are missing, making it difficult to compare directly with the paper’s reported RMSLE of 0.16118.

LightGBM Model

In this section, we reproduce the results from the referenced article using the LightGBM model. LightGBM, or Light Gradient Boosting Machine, is a fast and efficient machine learning algorithm that excels at handling large datasets. It is used for both classification and regression tasks. LightGBM works by building an ensemble of decision trees in a sequential manner, where each tree is trained to correct the errors of the previous ones. Unlike other algorithms that grow trees level by level, LightGBM grows trees leaf by leaf, which often results in more accurate models. This approach is guided by gradient descent, which helps minimize errors. LightGBM is designed for speed and scalability, making it suitable for large-scale data. It includes various hyperparameters for fine-tuning, which can significantly improve its performance. This makes LightGBM a popular choice for many machine learning tasks due to its high efficiency and accuracy.

The parameters used in the model are consistent with those specified in the article, ensuring comparability of results. Specifically, we optimized tree-specific parameters: min_child_weight = 2, num_leaves = 36, colsample_bytree = 0.8, learning_rate = 0.15. Additionally, we set the regularization parameter reg_lambda = 0.40.

Data Preparation for Modeling

We begin with data preparation for modeling such as defining the features and target variable for the model and splitting the dataset into training and testing sets.

```{r}
# Defining features and target 
features <- setdiff(names(housing_prices_data_clean), "totalPrice")
target <- "totalPrice"

# Splitting the data into training and testing sets 
set.seed(123456789)

data_which_train <- createDataPartition(housing_prices_data_clean$totalPrice ,
                                        p = 0.8, 
                                        list = FALSE) 

train_data <- housing_prices_data_clean[data_which_train,]
test_data <- housing_prices_data_clean[-data_which_train,]

# Preparing data for LightGBM 
trainMatrix <- lgb.Dataset(
  data = as.matrix(train_data[, -which(names(train_data) == "totalPrice")]),
  label = train_data$totalPrice)

testMatrix <- lgb.Dataset(
  data = as.matrix(test_data[, -which(names(test_data) == "totalPrice")]),
  label = test_data$totalPrice)
```

Model Parameters

The parameters for the LightGBM model are defined based on the article.

```{r}
# Defining LightGBM parameters based on the article 
params <- list(objective = "regression",
               metric = "rmse",
               min_child_weight = 2,
               num_leaves = 36,
               colsample_bytree = 0.8,
               reg_lambda = 0.40,
               learning_rate = 0.15,
               feature_fraction = 0.9 ) 
```

Training the Model

We train the LightGBM model with the specified parameters.

```{r message=FALSE, warning=FALSE, results = "hide"}
# Training the LightGBM model  
model_lgbm <- lgb.train(params,
                   trainMatrix,
                   nrounds = 1000,
                   valids = list(test = testMatrix),
                   early_stopping_rounds = 10) 
```

Making Predictions

The model’s predictions are made and adjusted to handle negative values.

```{r}
# Prediction 
test_predictions <- predict(model_lgbm, as.matrix(test_data[, -which(names(test_data) == "totalPrice")])) 

train_predictions <- predict(model_lgbm, as.matrix(train_data[, -which(names(train_data) == "totalPrice")]))  

# Excluding negative predicted values 
test_predictions <- ifelse(test_predictions < 0, 0.01, test_predictions)
train_predictions <- ifelse(train_predictions < 0, 0.01, train_predictions) 
```

Evaluation Metrics

We evaluate the model using RMSLE and MAPE metrics.

```{r}
# Evaluation metrics 

# RMSLE and MAPE for test data 
test_rmsle_score <- rmsle(test_data[[target]], test_predictions)
test_mape_score <- mape(test_data[[target]], test_predictions)  

# RMSLE and MAPE for train data 
train_rmsle_score <- rmsle(train_data[[target]], train_predictions)
train_mape_score <- mape(train_data[[target]], train_predictions)  

cat("Test RMSLE: ", test_rmsle_score, "\n") 
cat("Test MAPE: ", test_mape_score, "\n") 
cat("Train RMSLE: ", train_rmsle_score, "\n") 
cat("Train MAPE: ", train_mape_score, "\n") 
```
Test RMSLE:  0.1501456 
Test MAPE:  0.17798 
Train RMSLE:  0.1391439 
Train MAPE:  0.1163801 

Plotting Results

Finally, we plot the actual vs. predicted values for both training and testing data.

```{r}
#  Dataframe for plotting 

plot_test <- data.frame(
  Real = test_data[[target]],
  Predicted = test_predictions )

plot_train <- data.frame(
  Real = train_data[[target]],
  Predicted = train_predictions)  

# Plot 
test_p <- ggplot(plot_test, aes(x = Predicted, y = Real)) +
  geom_point(alpha = 0.6, color = "lightblue") + 
  geom_abline(slope = 1, intercept = 0, color = "darkorange", lwd = 0.7) +  
  labs(
    x = "Light GBM Predictions",  
    y = "Actual Price"  ,
    subtitle = "Test Data"
  ) +
  scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +  
  theme_minimal() +
  theme(plot.title = element_blank()) 


train_p <- ggplot(plot_train, aes(x = Predicted, y = Real)) +
  geom_point(alpha = 0.6, color = "lightblue") + 
  geom_abline(slope = 1, intercept = 0, color = "darkorange", lwd = 0.7) +  
  labs(
    x = "Light GBM Predictions",  
    y = "Actual Price"  ,
    subtitle = "Train Data"
  ) +
  scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +  
  theme_minimal() +
  theme(plot.title = element_blank()) 

grid.arrange(train_p, test_p, ncol = 2)
```

  • Our test RMSLE of 0.1501 is slightly better than the article’s test RMSLE of 0.1694.

  • Our train RMSLE of 0.1391 is also better than the article’s train RMSLE of 0.1669.

  • We also included MAPE as an additional metric, providing further insight into the model’s accuracy. Our test MAPE is 0.178, and train MAPE is 0.1164.

  • The number of estimators in our model was 387, significantly higher than the 64 estimators used in the article. This difference indicates that our model required more trees to achieve optimal performance, which might be due to differences in data preprocessing or parameter tuning.

In conclusion, our LightGBM model performs comparably to the one in the referenced article, with a slightly better fit on the training data. The inclusion of MAPE gives us a clearer picture of prediction errors, emphasizing the model’s practical accuracy in terms of percentage error.

Hybrid Regression

In this section, we describe the creation and evaluation of a hybrid regression model that combines the predictions from three different machine learning models: RandomForest, XGBoost, and LightGBM. Each model contributes equally to the final prediction, enhancing the robustness and accuracy of the prediction by leveraging the strengths of each individual model.

Loading Models and Generating Predictions

We begin by loading the pre-trained models for RandomForest and XGBoost and LightGBM. After loading these models, we generate predictions for the test dataset.

```{r message=FALSE, warning=FALSE, results = "hide"}
data_rf <- readRDS("data_rf_model.rds")
data_xgboost <- readRDS("data_xgboost_model.rds")

source("RR_HousingPricePred_LightGBM.R")

load("housing_prices_data_clean.Rdata")
set.seed(123456789)
data_which_train <- createDataPartition(housing_prices_data_clean$totalPrice ,
                                        p = 0.8,
                                        list = FALSE) 

data_train <- housing_prices_data_clean[data_which_train,]
data_test <- housing_prices_data_clean[-data_which_train,]

pred_rf_test <- predict(data_rf, data_test)
pred_xgb_test <- predict(data_xgboost, data_test)
pred_lgb_test <- predict(model_lgbm, as.matrix(data_test[, -which(names(data_test) == "totalPrice")]))

pred_rf_train <- predict(data_rf, data_train)
pred_xgb_train <- predict(data_xgboost, data_train)
pred_lgb_train <- predict(model_lgbm, 
                          as.matrix(data_train[, -which(names(data_train) == "totalPrice")]))
```

Combining Predictions

The predictions from the three models are combined using a simple average. This approach ensures that each model’s prediction contributes equally to the final prediction.

```{r}
combined_predictions_test <- (pred_rf_test + pred_xgb_test + pred_lgb_test) / 3
combined_predictions_train <- (pred_rf_train + pred_xgb_train + pred_lgb_train) / 3

combined_predictions_test <- ifelse(combined_predictions_test < 0,
                                    0.01, combined_predictions_test)
combined_predictions_train <- ifelse(combined_predictions_train < 0, 
                                     0.01, combined_predictions_train)
```

Evaluating the Hybrid Model

We evaluate the performance of the hybrid model using two key metrics: Root Mean Squared Logarithmic Error (RMSLE) and Mean Absolute Percentage Error (MAPE). These metrics provide a clear picture of the model’s accuracy in predicting housing prices.

```{r}
combined_rmsle_test <- Metrics::rmsle(data_test$totalPrice,
                                      combined_predictions_test)
combined_mape_test <- Metrics::mape(data_test$totalPrice, 
                                    combined_predictions_test)

cat("Combined Test RMSLE: ", combined_rmsle_test, "\n")
cat("Combined Test MAPE: ", combined_mape_test, "\n")

combined_rmsle_train <- Metrics::rmsle(train_data$totalPrice,
                                      combined_predictions_train)
combined_mape_train <- Metrics::mape(train_data$totalPrice, 
                                    combined_predictions_train)

cat("Combined Train RMSLE: ", combined_rmsle_train, "\n")
cat("Combined Train MAPE: ", combined_mape_train, "\n")
```
Combined Test RMSLE:  0.1434729 
Combined Test MAPE:  0.1566243 
Combined Train RMSLE:  0.1370518 
Combined Train MAPE:  0.1172748 

We compare the results of our hybrid model with those reported in the referenced article. The article reported a Train RMSLE of 0.14969 and a Test RMSLE of 0.16372.

Our hybrid model achieved:

  • Train RMSLE: 0.1371

  • Test RMSLE: 0.1435

These results indicate that our model performs better than the article’s model, both on the training and test data.

Plotting the Results

Finally, we visualize the performance of the hybrid model by plotting the actual vs. predicted prices. The plot includes a regression line to illustrate the relationship between predicted and actual prices.

```{r}
tibble(
  pred = combined_predictions_test,
  actual = data_test$totalPrice
) %>% 
  ggplot(aes(pred, actual)) +
  geom_point(alpha = 0.6, color = "lightblue") + 
  geom_abline(slope = 1, intercept = 0, color = "darkorange", lwd = 0.7) +  
  labs(subtitle = "Test Data", x = "Hybrid Model Prediction", y = "Actual Price") +
  theme_minimal() +
  scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +  
  theme_minimal() +
  theme(plot.title = element_blank()) 
```

In summary, the hybrid regression model combines predictions from RandomForest, XGBoost, and LightGBM to improve the overall prediction accuracy. By averaging the predictions, we leverage the strengths of each model, resulting in a robust and reliable prediction system. Our hybrid model outperforms the referenced model, achieving lower RMSLE on both training and test datasets.

Stacked Generalization

Stacked generalization involves amalgamating the outcomes of individual estimators and employing a regressor to generate the ultimate prediction. This technique leverages the capabilities of each individual estimator by employing their outputs as inputs for a final estimator.

The fundamental concept of this method revolves around utilizing the forecasts generated by preceding models as attributes for another model. Additionally, this strategy incorporates the k-fold cross-validation technique to mitigate the risk of overfitting.

This study employed a prevalent 2-level stacking architecture to predict housing prices. In the initial stacking level, Random Forest and LightGBM were utilized, while XGBoost constituted the second stacking level. Additionally, to accommodate the relatively large dataset, the stacking model was subjected to 5-fold cross-validation.

Our Stacked Generalization model achieved:

Train RMSLE: 0.294

Test RMSLE: 0.304

These results indicate that our model performs worse than the article’s model, both on the training and test data.

Conclusions

  • We got the expected results, and in the case of Hybrid Regression and LightGBM even better than the authors of the article

  • LightGBM and Hybrid Regression gave us better results than researchers’

  • For RF, XGBoost and Stacked Generalization we obtained worse results.

  • The training set exhibits lower error rates compared to the test set, suggesting a degree of overfitting.